Design an algorithm to encode a list of strings to a string. The encoded string is then sent over the network and is decoded back to the original list of strings.
Machine 1 (sender) has the function:
string encode(vector<string> strs) {
// ... your code
return encoded_string;
}
Machine 2 (receiver) has the function:
vector<string> decode(string s) {
//... your code
return strs;
}
So Machine 1 does:
string encoded_string = encode(strs);
and Machine 2 does:
vector<string> strs2 = decode(encoded_string);
strs2 in Machine 2 should be the same as strs in Machine 1.
Implement the encode and decode methods.
You are not allowed to solve the problem using any serialize methods (such as eval).
Example 1:
Input: dummy_input = ["Hello","World"] Output: ["Hello","World"] Explanation: Machine 1: Codec encoder = new Codec(); String msg = encoder.encode(strs); Machine 1 ---msg---> Machine 2 Machine 2: Codec decoder = new Codec(); String[] strs = decoder.decode(msg);
Example 2:
Input: dummy_input = [""] Output: [""]
Constraints:
1 <= strs.length <= 2000 <= strs[i].length <= 200strs[i]contains any possible characters out of256valid ASCII characters.
Follow up: Could you write a generalized algorithm to work on any possible set of characters?
Average Rating: 4.39 (23 votes)
Solution
Approach 1: Non-ASCII Delimiter
Intuition
Naive solution here is to join strings using delimiters.
What to use as a delimiter? Each string may contain any possible characters out of 256 valid ascii characters.
Seems like one has to use non-ASCII unichar character,
for example unichr(257) in Python and
Character.toString((char)257) in Java (it's character ā).
Here it's convenient to use two different non-ASCII characters, to distinguish between situations of "empty array" and of "array of empty strings".
Implementation
Use split in Java with a second argument -1 to
make it work as split in Python.
Complexity Analysis
-
Time complexity : O(N) both for encode and decode, where N is a number of strings in the input array.
-
Space complexity : O(1) for encode to keep the output, since the output is one string. O(N) for decode keep the output, since the output is an array of strings.
Approach 2: Chunked Transfer Encoding
Pay attention to this approach because last year Google likes to ask that sort of low-level optimisation. Serialize and deserialize BST problem is a similar example.
This approach is based on the encoding used in HTTP v1.1. It doesn't depend on the set of input characters, and hence is more versatile and effective than Approach 1.
Data stream is divided into chunks. Each chunk is preceded by its size in bytes.
Encoding Algorithm
-
Iterate over the array of chunks, i.e. strings.
-
For each chunk compute its length, and convert that length into 4-bytes string.
-
Append to encoded string :
-
4-bytes string with information about chunk size in bytes.
-
Chunk itself.
-
-
-
Return encoded string.
Decoding Algorithm
-
Iterate over the encoded string with a pointer
iinitiated as 0. Whilei < n:-
Read 4 bytes
s[i: i + 4]. It's chunk size in bytes. Convert this 4-bytes string to integerlength. -
Move the pointer by 4 bytes
i += 4. -
Append to the decoded array string
s[i: i + length]. -
Move the pointer by
lengthbytesi += length.
-
-
Return decoded array of strings.
Implementation
Complexity Analysis
-
Time complexity : O(N) both for encode and decode, where N is a number of strings in the input array.
-
Space complexity : O(1) for encode to keep the output, since the output is one string. O(N) for decode keep the output, since the output is an array of strings.
Last Edit: February 29, 2020 10:25 PM
Understanding bytes = [chr(x >> (i * 8) & 0xff) for i in range(4)]
-
all you need to do is encode the length of x into 4 bytes ( why 4 bytes - integer size - 4 bytes = [8bits, 8bits, 8bits, 8bits])
-
ok so how do you get a X(length of str) total size into chunks of 8 bits ?
2.1 >> is right shift - which means if you have 101111 >> 2 - this right shift moves 101111, 2 times to the right - which
becomes 1011
2.2 if you go to python terminal and type 0xff you get 8 1's which represents 11111111 = a BYTE
2.3 if you AND(&) any number with 0xff = it gives you the right most 8 bits of the number
example: 1. bin(291) (binary of 291) is '0b100100011'
2. oxff binary is '0b11111111'
3. now if you 291 & 0xff = you get last 8 bits of 291 == 00100011
If you understand this - you understood the solution.Now as explained in the 2 point. The python solution line 6 we take the whole length of the string (len(str)) - we have to
encode that into [8bits, 8 bits, 8bits, 8bits]
example: 1. lets say our total length is 291 ( in binary its bin(291) = '0b100100011')
2. what you have to do now ? we grab the right most 8 bits - how do you grab right most 8 bits ?
2.1 as mentioned above you do 291 & 0xff = you get last 8 bits -
Now you already have right most 8 bits - in next iteration you are interested in NEXT 8 bits - how do you get
that ? we move 291 >> 8 ( which removes the last 8 bits we already computed) - which means if you do
(291 >> 8 ) & 0xff = it gives you the next right most 8 bits -
as already mentioned we need 4 chunks of 8 bits [8bits, 8bits, 8bits, 8bits]
4.1 [ for i in range(4)] thats why the range is 4 ( 0, 1, 2, 3)Once you compute all the 8bits we need to convert to char hence its [chr((x >> (i * 8)) & 0xff) for i in range(4)]
Hope this helps!
September 9, 2019 2:39 AM
what if length of string exceeds the Integer.MAX_VALUE (which means 4 butes cannot hold the length)?
November 19, 2019 12:29 AM
The space analysis is wrong here - just because a solution uses one string doesn't make it O(1) as strings have a variable size. Clearly the output of encode in both of these examples is linearly proportional to the size of the input.
September 9, 2019 11:32 AM
Chunked encoding is too clever for me... I dont think I can solve this in an interview lol
what is "BE CodecDriver error? Python 3.* version do not have the issue atleast I do not see it. Also python 3.0 version uses chr() instead of unichr
July 8, 2019 5:54 PM
HI, I am a little confused by this statement. For each chunk compute its length, and convert that length into 4-bytes string. It is actually a 8 bytes string since a char is 2 bytes (16 bits). And then the stored amount is sub-optimal, it uses 64 bits to store a 2**32 number.
August 23, 2020 7:01 AM
Isn't a char 2 bytes?
Last Edit: July 6, 2019 9:55 PM
What about just implement the full HTTP Chunked Transfer Encoding and dump length as readable text representations. If you dump integer as bytes you can easily encounter endian issues between machines and things like strict pointer aliasing violation if you are using C/C++ and are not careful enough about dark corners of the language standards (E.g. if you used an int* to dereference data stored in a char* / string, you entered the land of undefined behaviour).
One variation on the chunk encoding is to use a "header" to start the packet off with a list of str lengths and end with some sentinel. Because you strip the header off at the start of every decode you put whatever you want in there to help you out with decoding. This is sort of like what IP does. A network packet gets some header for transport that later gets stripped.
packet to encode: [abc, def, ghi] -> "3 3 3Aabcdefghi"
decoder strips off everything up till 'A' and uses that to figure out the string lengths.
Approach 1 by using escape character
class Codec:
def encode(self, strs: [str]) -> str:
return str(len(strs)) + '\t' + '\t'.join(strs)
def decode(self, s: str) -> [str]:
length, *strs = s.split('\t')
return [] if length == '0' else strs
Approach 2 by prefix length
class Codec:
def encode(self, strs: [str]) -> str:
return ''.join([f'{len(s)}:{s}' for s in strs])
def decode(self, s: str) -> [str]:
result = []
while s:
length, s = s.split(':', 1)
result.append(s[:int(length)])
s = s[int(length):]
return result
You don't have any submissions yet.
xxxxxxxxxxclass Codec {public: // Encodes a list of strings to a single string. string encode(vector<string>& strs) { } // Decodes a single string to a list of strings. vector<string> decode(string s) { }};// Your Codec object will be instantiated and called as such:// Codec codec;// codec.decode(codec.encode(strs));


